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ABSTRACT 


Salinas, California has been battling an above average crime rate for over 30 
years. This is due primarily to two rival gangs in Salinas: the Nortehos and the 
Surenos. The city and the surrounding community have implemented many 
methods to mitigate the crime level, from community involvement to the inception 


of a gang task force. As of yet, none of the efforts have had long-lasting effects. 


In a 2009 thesis, Jason A. Clarke and Tracy L. Onufer postulated that 
various socio-economic variables are influential on the crime level in Salinas. 
They characterized “crime” as a summation of homicides, assaults and robberies 
reported. Their thesis determined that “to lower overall violence levels, officials in 
Salinas should focus on: reducing the unemployment rate, the number of vacant 
housing units, and the high school dropout rate; and increasing the high school 


graduation rate and average daily attendance.” 


A deeper examination of the data could lead not only to assumptions 
about how to lower crime rates, but also to a means of predicting future crime 


rates by using various methods of multiple value regression. 
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I. INTRODUCTION 


The Salinas Police Department (SPD), in conjunction with the community 
leadership in Salinas, has been working tirelessly to mitigate gang related crime. 
Numerous efforts are currently in practice to reduce the city crime rate, from 
community involvement to the making of a gang task force in association with the 
surrounding county police offices. All of these efforts are derived from experience 
and, as seen in other cities, with no statistical model to predict future levels of 
violent crime in Salinas. This study’s purpose is to give Salinas a tool to predict 


crime using socio-economic statistics easily attainable from public sources. 


A. PREVIOUS RESEARCH 


In December 2009, Jason Clarke and Tracy Onufer completed a NPS 
thesis entitled “Understanding Environmental Factors that Affect Violence in 
Salinas, California” (Onufer & Clark, 2009). Their research compared nine 
environmental factors: economy; population; housing; education; police force; 
prison influence; gang rivalry; social service programs; and community 
involvement against the yearly violence rate in Salinas to determine which 
environmental factors, if any, are correlated with the violence levels in Salinas. 
Clarke and Onufer considered violence a combination of reported homicides, 


robberies, and assaults. 


The resulting recommendation of Clarke and Onufer’s research was 
summarized to follow Mayor Dennis Donahue’s “four-fold [strategy]: prevention, 
intervention, a newly envisioned and expanded police department and enhanced 
community engagement and mobilization” (Stahl, 2009, para. 51). Clarke and 
Onufer showed that violence was highly correlated with education and dropout 
rate. This led Clarke and Onufer to conjecture that with an increased emphasis 
on education and prevention, violence rates would decrease. Intervention was 
postulated to be established through vocational, education, counseling, and 
rehabilitation programs. Clarke and Onufer suggested opening a _ police 
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substation in the center of the gang territory as a police expansion strategy. Their 
final recommendation was to start a Mayor's Gang Prevention Task Force 
(MGPTF) like San Jose to enhance community engagement and mobilization. 


Clarke and Onufer’s correlation analysis lead to the conclusion that there 
are four highly correlated environmental factors to the Salinas violence rate: 
unemployment rate, number of vacant housing units, high school dropout rate, 


and daily school attendance rate. 


B. RESEARCH OBJECTIVE 


Using previously established environmental variables, multi-variable 
regression models were created to predict future violence levels using statistical 
analysis techniques. A program was also written to allow for automatic 


regression for further exploration and analysis of the Salinas environmental data. 


C. BACKGROUND 
1. History of Violence in Salinas 


Small tribes of Native Americans inhabited the City of Salinas until around 
1822. In 1822, Mexico gained independence from Spain and outside settlers 
began to arrive in Salinas. From the 1820s to the 1890s, the Salinas Valley was 
used primarily for ranching and wheat and barley growth. After the 1890s, 
advances in irrigation and agricultural practices introduced the sugar beet 
industry to Salinas. In the 1920s, sugar beets and beans gave way to the farming 
of lettuce because of the ice bunkered railroad, allowing fresh produce shipment 
nationwide. The area continues to grow lettuce and other green vegetables to 
this day (Seavey, 2010). 


The success of the farming industry helped give rise to the nickname “The 
Salad Bowl of the World” to Salinas, fueling a “$2 billion agriculture industry 


which supplies 80% of the country's lettuce and artichokes, along with many 


other crops” (History of Salinas, 2012,para 3). Every year, thousands of migrant 
workers travel to Salinas from Mexico to work on the farms during the harvest 


season. 


The population and the racial demographic in Salinas has greatly changed 
from 1980 to 2011. The population in Salinas in 1980 was 80,479 (Chapman, 
1982, p. 18) and increased to 150,441 by the 2010 census (State & County 
QuickFacts: Salinas, California, 2012). In 1980, Salinas was 38.1% Hispanic and 
increased to 75.0% in the 2010 census (McFarlane, 2012). 


The Salinas gang problem can be traced back to the 1950s. In his 2009 
90-day report, Police Chief Fetherolf quoted the 1950 Police Chief Mcintyre’s 
statement “Gang fights will not be tolerated in the City of Salinas” (Fetherolf, 
2009,p. 6). The 90-day report goes on to mention various instances of Salinas’s 
violence prior to his 2009 report. As displayed in Figures 1-3, the crime in 
Salinas has steadily increased and maintained a higher level than the national 
average over the past two decades. In 2009, Salinas had a record breaking 29 
homicides, about four times the national average, followed by 19 homicides in 
2010, again about four times the national average. In 2009, Salinas was ranked 
4th in California for homicides per capita (Fetherolf, 2010, p. 2). Figures 1, 2 and 
3 show a comparison of Salinas homicides, robberies and assaults, respectively, 
compared to national averages. Homicides, assaults, and robberies were added 


together and used as one statistic. They were labeled as violence in this study. 
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Figure 1. Salinas Homicides versus U.S. Average Homicides 1980—2010 
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Figure 2. Salinas Robberies versus U.S. Average Robberies 1980—2010 
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Figure 3. Salinas Assaults versus U.S. Average Assaults 1980-2010 


Much of the violence in Salinas is gang-related, stemming from two 
feuding gangs, the Nortehos and the Surehos. These two gangs often live in 
Salinas physically separated by only one or two city blocks. Both of the gangs vie 
for control of the drug and prostitution trade within the city limits. 


The Surenhos, Spanish for Southerner, are a Hispanic gang originating 
from “a prison dispute between the Mexican Mafia (La Eme) and Nuestra Familia 
(NF)” (Surefios, 2005). The original members of the gang were associated with 
the urban Hispanic population, distinguishing themselves from the rural farm 
working Hispanics. However, since its inception, the Surenos have turned into 
one of the largest gangs in the United States. Surenos have migrated from 
California over the past decade and are now living and active in almost every 
state in the country (Morales et al., 2008, p. 8). 


The Nortehos are associated with the Nuestra Familia gang. The 
Nortenos, Spanish for “Our Family,” are rumored to have started their rivalry with 
the Surenos because a member of the Surefios stole a pair of shoes from a 
Nortehos member in prison. This incident started the conflict between the 


Surefos and the Nortenos that continues today (Hennessey, 2003). Nortenos are 


most widely linked with the rural Hispanics in Northern California. 


Bakersfield, California is widely accepted as the division point between 


Northern and Southern California gangs. However, the gangs often ignore 


traditional boundary lines because of familial ties or gang related opportunities. 


Salinas is ideally situated to accommodate both gangs because of its proximity to 


large agricultural and mid-sized urbanized areas. Drug trafficking through Salinas 


is also very common due to its location between San Francisco and Los Angeles. 


There are 11 factors identified in Chief Fetheroff's 2009 90-day report, 
which contribute to the Salinas gang-crime problem: 


The close proximity of two state prisons to the city (incarcerated 
gangsters directing gangster activities outside of the prisons) 


The effects of poverty, exacerbated by a sagging economy 


Dysfunctional or struggling families, providing too little juvenile 


supervision 

Lack of positive adult male role models 

Multi-generational gang families 

Lack of effective teacher/student attachments with at-risk youth 
Inadequate education and an elevated high school drop-out rate 
Drug sales money enticing youth into gang affiliation 


Increased violence in media and video games, desensitizing youth to 


the impacts of violence 
Limited opportunities for after-school recreation 


Migrating gangsters infiltrating and victimizing law abiding, hardworking 
seasonal farm workers, many of whom are fearful victims of unreported 
crime (Fetherolf, 2009, pp. 7-8). 


6 


The close approximation to two different prisons plays a key role in the 
perpetuity of gang-related violence in Salinas. Salinas Valley State Prison and 
the Correctional Training Facility are located about 25 miles southwest of Salinas 
outside of Soledad. The California Prison System is running at 200% capacity as 
of 2010. Being overcrowded, the California Department of Corrections and 
Rehabilitation (CDCR) is taking steps to reduce the overcrowding. Some of these 
steps include inmates being placed out of state, non-revocable parole, and 
inmate population reduction (Actions CDCR Has Taken to Reduce 
Overcrowding, 2012). 


A report by Rand Corporation in their Record on Research about Criminal 
Behavior (2009) estimates that a prisoner will commit an average of 13 crimes 
after released early from prison. The current court ordered capacity for the 
California Prison Systems is 147% by November, 2012. This is a reduction from 
168,830 inmates to 117,000 inmates in a one-year period (Burke & Cavanaugh, 
2011). This equates to around 50,000 inmates released. With each early release 
potentially committing crimes, this could result in as many as 500,000 crimes in 
California. The early release program sends the inmates back into the community 
in which they were arrested and Salinas could see a percentage of this increase 


in crime rate from the early release program. 


The recidivism rate in California as of 2010 is 67.5% within three years of 
release (Cate, 2010, p. 32). Many of the gangs in America, to include the 
Nortehos and the Surefos , have strong ties to the prison system and are still 


primarily run from the leadership that is incarcerated. 


2. Current and Past Efforts to Reduce Violence in Salinas 


In 1995, the Clinton Administration “awarded the Salinas Police 
Department nearly $1 million as part of the COPS [Community Oriented Policing 
Service] Youth Firearms Violence Initiative’ (Success Stories, 2011). Salinas 
Police Department used the money to create a permanent anti-gang task force. 
During this same period, the Salinas Police Department also created a “Violence 
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Suppression Unit (VSU) to take firearms away from youth and gang members” 
(Success Stories, 2011). Finally, Salinas also instituted “Peace Builders,” which 
was to encourage non-violent behavior for elementary school-aged children. 
Finally, Salinas also instituted a 20-city block clean-up program to remove clutter 
and garbage from the streets. These efforts did result in a 50% decrease of 
homicides from 1995 to 1996 and a slight reduction in robberies and assaults. 
However, as of 2002, the homicide rate was back up to a record high of 20. 


In 2010, Salinas instituted Operation Ceasefire and Operation Knockout. 
Operation Ceasefire was a program successfully instituted in 1996 in Boston, 
Massachusetts. Operation Ceasefire was a “direct law enforcement attack effort 
on illicit firearms traffickers supplying youths with guns and an attempt to 
generate a strong deterrent to gang violence” (Record on Research about 
Criminal Behavior Corrected, 2009). The local law enforcement made it clear that 
there would be zero tolerance on gang related activity. When the officers 
received reports of gang-related activity, the Boston Police Department held 
gang crackdowns, arresting gang offenders. The result of the policy is that the 
“Gang violence in Boston declined abruptly” and “it was unnecessary to repeat 
the crackdowns or move out gradually along the gang network as originally 
planned” (Record on Research about Criminal Behavior Corrected, 2009). 


As of May 2010, the Salinas Police Department had two Operation 
Ceasefire call-ins, inviting community gang members to meet with personnel who 
assisted the gang member to leave behind their gang life. “Those agreeing to 
take part in the program are offered employment opportunities, training and 
personal services — from résumé-building to tattoo removal” (Solan a, 2010). The 
call-ins are also useful in informing the gang members of the zero tolerance for 


gang- related activities in the community. 


Operation Knockout was “an eight-month operation ... aimed at 
apprehending members of the Nortehos and Surenhos gangs that turned Salinas 
into a hub of murder, robbery and drug dealing” (San Francisco Citizen, 2010). 


The operation was a multi-organization operation led by the California 
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Department of Justice’s Bureau of Narcotics Enforcement in collaboration with 
the Salinas Police Department and other local agencies. The culmination of the 
operation resulted in 44 arrest warrants, leading to 37 arrests and the seizure of 
over 50 pounds of illegal drugs and paraphernalia. Operation Knockout was also 
an attempt to “cripple the gang's grip on younger gang members in the area and 
make existing gang-violence intervention efforts, such as Ceasefire, more 
effective” (Reynolds, 2010). 


D. REGRESSION ANALYSIS TO PREDICT CRIME 


Police departments nationwide use some form of crime prediction. The 
first instance of formal crime analysis was instituted by August Vollmer in the 
early 1900’s (Boba, 2005, p. 20). Vollmer’s method involved the “use of pin 
mapping, the regular review of police reports, and the formation of patrol districts 
based on crime volume” (Grassie et al., 1977). This method of crime analysis 
lasted into the 1970s. 


In 1968, the Omnibus Crime Control and Safe Streets Acts greatly 
increased awareness to the analysis of crime statistics and crime prediction. The 
act authorized grants to the States to fund efforts to reduce crime rates (P.L. 90- 
351). This shift of attention from crime prosecution to crime prediction led to 
many police departments adopting crime analysis techniques, finally concluding 
in the creation of the International Association of Crime Analysts in 1991 and the 
implementation of Compstat, a data-and-mapping driven strategy at police 
management for increasing the awareness of crime analysis (Boba, 2005, p. 23). 


Currently, most police agencies use some type of crime analysis in every- 
day operations. In a survey of over 17,000 agencies, Mamalian and La Vigne 
found that 73% of agencies use crime analysis to fulfill the Unified Crime Report 
and around 52% calculate statistical reports on criminal activity. However, out of 
all of the agencies, only 13% use some type of computerized crime analysis, the 
majority of agencies preferring the more conventional pushpin maps to the more 


advanced computerized techniques (Mamalian & LaVigne, 1999). 
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For those agencies that use some type of crime analysis, the emphasis is 
on the short-term, tactical goals of the police department as opposed to the long- 
term strategic uses of the data. The 2003 report by O’Shea and Nicholls states 
that the view of the police officers is, first and foremost, the apprehension of 
criminals and secondly the “sophisticated police tactical and strategic decision 
outcomes and solutions to chronic crime problems” (O’Shea & Nicholls, 2003, p. 
25). The report goes on to state that the data and methods could better be 
served as a deterrent tool as opposed to an apprehension tool and this effort 


would take a cooperation between police officials and academics. 


Current forecasting models are primarily built around crime mapping using 
a geographic information system (GIS). This system is a computerized version of 
the pushpin map model defined as “a set of computer-based tools that allows the 
user to modify, visualize, query, and analyze geographic and tabular data” (Boba, 
2005, p. 37). GIS is a computerized tool to assist in departmental crime mapping. 
Crime mapping uses geographical information to conduct special analysis of 
crime problems to assist in resource allocation for police agencies (Boba, 2005, 
p. 37). 


A 1998 study by Diana Ehlers and Gideon Pimstone used various factors 
to predict crime rate per 100,000 in the United States. These factors included: 


e higher unemployment and increased economic deprivation 
e political instability 
e urbanisation patterns 


e successful implementation of a crime prevention campaign which calls 


on people to report crime 
e increased public awareness of crime 


e improved police detection resulting in greater recording of crime 
(Ehlers, 1998). 
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The conclusion of their research was that statistical methods could be 
used to predict crime trends. Regression analysis and correlation are useful tools 
to predict crime patterns and through these methods, policy makers can be given 


a “statistical glimpse of the future” (Ehlers, 1998). 


In “Forecasting Crime, a City-Level Analysis,” John V. Pepper (2007) 
explored the ability of different regression models to predict crime rates. In his 
study, Pepper used two primary variables to predict homicide rates: “the percent 
of the population that are 18 year old males and the fraction of the population 
(per 100,000) that are incarcerated” (Pepper, 2007, p. 4). His research 
concentrated on linear regression models and used lag regression techniques, 
which use previous homicide levels to predict future homicide levels, as well as 
other variables in the model. Pepper’s research concluded that naive walk 
prediction, a method of prediction that uses previously witnessed statistics and 
used by many police departments, does well for very short-term prediction, but 
regression analysis out performs naive walk prediction for long-range 


forecasting. 


Dr. Wayne Osgood took a different regression based-approach to 
predicting aggregate crime rates. Osgood explored using Poisson and negative- 
binomial regression for crime rate predictions in his 2000 study (Osgood, 2000). 
His research argues, “Poisson regression analysis explicitly addresses the 
heterogeneous residual variance that presented a problem for [ordinary least 
squares] regression analysis of crime rates” (Osgood, 2000, p. 27). Osgood then 
went on to explain that negative binomial regression may be the best method 
because negative binomial does not have the problem of increased variance that 
occurs in Poisson regression. This method allows for a more varied approach at 


crime analysis. 


Linear, Poisson, and negative binomial regression were all used in this 
study in an attempt to find the best regression tool for crime rate prediction in 


Salinas. All three of the methods are discussed, in detail, in the next section. 
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ll. SUMMARY OF REGRESSION 


A. ORDINARY LEAST SQUARES METHOD 


There are two different variable types associated with regression. The first 
is the class of independent variables or regressors. Independent variables are 
observed through research and study. The other type of variable is the 
dependent variable or response variable. The purpose of regression is to model 
and investigate the relationship between the dependent variable and the 
independent variable. Equivalently the errors €.,...,€,are independent and 
normally distributed with mean 0 and variance o”. The Bs are estimated by 


minimizing the error or residual sums of squares: 


SB B.)= 31. -[ Bo + 3} (1) 


To find the minimum of (2) with respect to B , the derivative of the function 
in (2), with respect to each of the Bs, is set to zero and solved. This gives the 


following equations: 





n a — 
= 23 -(8+$8,x |] -0/-0.92. (2) 
OB, Bosbre-By isl j=l 
and 
oS 2 i SR 
a --23[y,-[4+ 38% |) =0,/ =1,2....,k. (3) 
Pili ie 





The Bs , the solutions to (3) and (4), are the least squares estimates of the 
Bs. 


It is useful to express both the n equations in (1) and the k+1 equations in 


(3) and (4) (which are based on linear function of the Bs) in matrix form. The 


model (1) can be expressed as 
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y=XB+e (4) 
where yis the nx1 vector of observations, X is an nx(k+1) matrix of independent 
variables (and an extra column of 1s for the intercept 6, ), B is a (k+1)x1 vector 


of coefficients and eis an nx1 vector of independent and identically distributed 


errors associated with (1). 


In order to find the, the (k+1)x1 vector of Bs and the estimate of B that 
minimizes the error, (2) in matrix form is: 
S(B) = (y — XB)" (y — XB) 

=y'y-B'X"y—y’XB+B'X'XB (5) 

=y"y—2B'X’y +B’ X'XB 
with a superscript “J” denoting the transpose of a matrix or vector. The 
expression B’ X’yis a scalar. Therefore, the least-squares estimator must satisfy 
the (k+1) equations (3) and (4) written in matrix form as: 


OS) 2 -2X"y +2X™XB =0 (6) 


OB|, 
where 0 is the (k+1)x1 vector of 0’s. This equation can be simplified to: 


X’XB=X’7y (7) 


Under appropriate conditions (i.e. X’Xis not singular), this formula will 
finally net the least squares coefficients: 
B=(X"X)'X’y (8) 
These coefficients can then be used for predicting or estimating the 
expected dependent variable for values of the independent variables that do not 
need to be in the sample used to estimate B . 
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B. RELATIONSHIPS AMONG VARIABLES 
1. Correlation 


In practice, there are often many candidate independent variables that can 
be used in the regression equation. One of the most difficult tasks of an analyst is 
to determine which of these to use. In order to determine the variables to use ina 
regression, the relationship between the dependent and independent variables 
must be established. The relationship between the independent variables must 
also be examined. An important relationship for this study is correlation and is 
defined as the linear relationship between two variables. This relationship is 
measured between pairs of observed variables. For example, simple linear 
regression with one _ independent’ variable, observations are the 


pairs (X,,Y,),(X5,V>)s--s(X,,¥,)- The correlation between these two variables is 


measured using the sample correlation coefficient with the formula: 


Dil —X) 








with xX and y representing respectively the mean of the observed independent 
and dependent variables. The coefficient r takes values between -1 and 1, 
inclusive. A result of -1 implies a perfect negative relationship between the two 
variables wherein an increase of one variable indicates a decrease in the other. 
A positive relationship indicates an increase or decrease in both variables 
simultaneously. A result near zero indicates no or a very small linear relationship 


between the variables. 


Ideally, a good regression fit will include a dependent variable highly 
correlated with the independent variables, with a correlation value between 0.5 
and 1 or between -1 and -0.5. A good regression fit will also have independent 
variables with very low correlation with a sample correlation for any pair of 


independent variables, between -0.5 and 0.5. Including highly correlated 
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independent variables does not add to the regression and can lead to a non- 
generalized, overfit regression model. Having highly correlated independent 
variables in the regression is called multicollinearity which can lead to difficulty in 
interpretation and, when extreme, will cause X’X of equation (9) to be ill- 
conditioned. Further, perfect linear dependence among the independent 


variables will cause X’X to be singular and give infinite least squares estimates 


of B in equation (8). 


2. Transforming Dependent and Independent Variables 


Oftentimes, a straight line will not be the best fit of the dependent 
variables as a function of the independent variables. Therefore, the variables, 
either dependent or independent must be transformed or adapted. Some 


common transformations of variables are: 
e Take the variable to a power 
e Use the natural log function on the variable 
e Invert the variable 
e Multiply several variables together (interactions) 


After transformation, the variables will then be put back into the 
regression. As a standard of practice, the original variable will be left in the 


regression with any transformations. 


C. VARIABLE SELECTION FOR REGRESSION 


In this section, the focus is directed to the most pressing issue of the 
study, that of selecting the independent variables. Various methods are used to 
test the adequacy of the regression model. Should too many variables be added 
into the model, the model could be overfit and only applicable to the given 
dataset. There are various methods to determine the goodness-of-fit for the 
regression. The methods used in this analysis were hypothesis tests for the 
regression, hypothesis tests for each of the coefficients, R-squared for the 
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regression, and hypothesis tests based on the analysis of variance for the 
regression. These tools are also used as a basis for the goodness-of-fit for the 


regression model. 


1. Hypothesis Test for Regression 


The hypothesis test for regression can be performed for each of the 
coefficients separately and for the entirety of the regression. The hypothesis test 
for on a single coefficient tests: 

H, : 6, =9, 
oe (10) 
H,:B, #0 


The equation used to test this hypothesis is: 


7 oan 
F =| 11 
55 oe 


where b, is the ith coefficient to be tested and se(B,) is the standard error of that 
coefficient, calculated from: 
(X™X); "(Ly - XB] Ly -XB])/(n-k) (12) 
with k being the number of 6 parameters in the model including the intercept 6,, 
and where (X’X)'‘.denotes the (i)th diagonal element of square matrix 
(X’xy". 
The hypothesis test for the regression tests: 


H, :B, =.--B, =0 (13) 
by using the equation: 
_ [y-y"(y-¥)]-[ (y—XB)" (vy —XB) | / (4) 
= ~ (14) 
| (y— XB)" (y- XB) |/ (n-k) 


where y represents the nx1 constant vector where each element is the average 





of y,,...,y,- Under the null hypothesis, F’ has an F-distribution with k and n-k 


degrees of freedom. Using the F-distribution with the calculated F-statistic one 
can find the probability of seeing a value in the F-distribution of the size of the F- 
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statistic or larger. A small p-value indicates that at least one of the coefficients is 
not zero. A large p-value indicates that there is not enough evidence from the 
data to show any relationship between the dependent and independent variables. 
Graphically, the p-value is shown in Figure 4. 


p-value 


F* 
Figure 4. Graphical Representation of the p-value 


2.  R-Squared (R’) 

R-Squared, often abbreviated R?, which is also called the coefficient of 
determination or the percentage of variance explained is equated by the following 
formula: 


wr _ vy 
oe i YY ssp 


_.» SST 
DLWiayy 
i=1 
where SSR is the sum of squares regression and SST is the sum of squares total 


(15) 


and J, is the fitted or predicted value for the ith observation. 


R? is the ratio of the sum of predicted values minus the mean of the 
observed dependent values squared over total sum of squares. Ideally, this 
number should be as close to 1 as possible, signifying that the predicted values 
for the dependent variable are very close to the actual values for the dependent 
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variable. An R? value near 1 indicates that most of the variability in the observed 
y values is accounted for in the model. 
3. Residual Standard Error 


One important test to ensure the validity of a regression model is to study 
the Residual Standard Error (RSE) for the model. The equation for the RSE for 
the model is: 


(16) 





This equation is a method of estimating how far the fitted values are from 
the actual observed values. This value is also used in cross-validation of the 
model, a tool used to negate model overfitting and to be explained in another 


section. 


4. Analysis of Variance 


The Analysis of Variance (ANOVA) organizes the computation of the test 
statistics for a sequence of hypothesis tests. The most common ANOVA tests the 
sequence of hypothesis which adds coefficients into the model, one at a time in 
order to test the increased significance of the model with the independent 
variable added. The sequence of hypothesis tested: 

Hy By 
H, : B, + B,X. 
1 By B, 1 (17) 
Ay? By H+ BX 
An F-statistic is calculated for each step of the ANOVA with H_(y,) being 


the predicted value of the ith observation based on the model in the jth 
hypothesis. 
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n n 


(Y; -H_(¥,)! -S yy; -H,(¥;)y 
Fa a f= Zeus: (18) 


n 


DV; -y, In-k 


i=1 





As with the p-value test, if the F-statistic is very large, implying a very 
small p-value, the nth independent variable should be left in the regression. As 
before, p-value is the probability of observing an F-statistic as large or larger than 
are computed from the data. A small p-value corresponding to the test statistic 


for the test of the null hypothesis H,, against the alternative H, indicates that 


the jth regressor is needed in the regression equation when the previous j-1 


regressors are already accounted for in the model. 


One thing to note about the ANOVA is the p-value of an independent 
variable may be large, but may still be left in the regression. This is because, as 
a general rule of regression, hierarchical terms are left in the regression if higher 
power terms have a lower p-value. For example: if a squared term has a very low 
p-value, but the linear term has a high p-value, the linear term will be left in the 


regression. 


D. GENERALIZED LINEAR MODELS 


Generalized linear models (GLM) include linear regression explained in 
the previous section. GLMs are a “unifying approach to regression and 
experimental design models, uniting the usual normal-theory linear regression 
models and nonlinear model” (Montgomery, Peck & Vining, 2006, pp. 454—455) 
where the dependent variable can have a distribution from a family of distribution 


other than normal, such as Poisson, exponential, or binomial. 


It is still necessary to estimate the coefficients in order to predict the 
dependent variable for a GLM. However, a GLM will have an additional, called 
the link function, which gives the relationship between the expected dependent 
variable and the linear function of the independent variables. A GLM takes the 


form: 
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g[E(y,)]=x;B 
E(y,)=9"'(x/B) 
where the function g is called the link function and x,is the (k+1)x1 vector of 1s 


(19) 


and the values of the k independent variables for the ith observation. 


Two non-linear approaches were used in this study, Poisson regression 
and negative-binomial regression. Both of these methods will be explained in the 


following sections. 


1. Poisson Regression 


Poisson regression is based on the fact that the dependent variable may 


have a count based distribution. The formula for a Poisson distribution is 





[jar =, Had: (20) 
y 


where p>O and represents the mean of y. The variance for the Poisson 


distribution is also identically py. 


For Poisson regression, the log function is used as the link function, 
therefore for /=1,...,.9 where each y, has a Poisson distribution with mean u,, 
the Poisson regression model is expressed as: 

E(y\) =; 
Q(u,) =; B (21) 
u, =9 *(x;B) 
Substituting the log link function gives: 
In(u,) = x7B 


(22) 
HU; = e* B 


The maximum-likelihood estimation (MLE) approach must be employed in 
order to estimate the Bs for the regression. Finding the MLE for the Poisson 


regression starts with the expression for the likelinood of observing y as a 


function of B (where y,,...,y, are assumed independent): 
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#24 it 


Tle“ Il uy (23) 


Liy.8)= [Tid -L 


ince )+ Yin(y?)=Srinty;!) (24) 


n n 


=D, + Ly. in(u,)-Yin’y,) 


f=1 i=] 
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Denoting In(L(y,B)) by /(y,B), as in ordinary least squares, the goal is to 
solve the following equation: 
OME! = ST yx] -xje"* ]= 9x) (y,-e**)=0 (25) 
There is no analytical approach to solving for the coefficients in the above 
equation. Therefore, at this point, some type of numerical method, such as the 
Newton-Raphson technique using iteratively reweighted least squares, is used to 
estimate the coefficients (Montgomery, Peck & Vining, 2006, p. 575). 


a) Goodness-of-Fit with the Poisson Model 


Goodness-of-fit for a Poisson model is measured using the residual 
deviance instead of R* or the residual standard error used in linear regression. 


The formula for residual deviance for Poisson regression is: 


0-23 yn{24)-%, #0) (26) 


I 
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The residual deviance should be as small as possible. For Poisson 
regression, the residual deviance, ideally, will be close to or less than the number 
of observations minus the number of parameters, or the residual degrees of 
freedom of the model. If the residual deviance is too much greater than the 
residual degrees of freedom, the model may not be a good fit and must be 
modified. 


2: Negative Binomial Regression 


A Poisson model assumes that the variance of the dependent variable will 
equal the mean of the dependent variable. Oftentimes, this assumption does not 
hold and the dependent variable is more variable than can be accounted for by 
the independent variables in the Poisson regression. As a means to remedy this, 
the negative-binomial regression technique can be employed. The negative 
binomial distribution is closely related to the Poisson distribution in that the 
negative binomial is a measure of instances of an event until reaching a 


concluding event. The formula for the negative binomial distribution is: 








ryy= Ma py py y =0,12... (27) 
yIF(r) 
with 
_ Pr 
ts aaa 
Var(y)= Tay. 


where r>0 and, when it is an integer, r can be interpreted as the number of 
failures and y is the number of successes required to get exactly r failures and p 
is the probability of a success. The function ['(x)is the gamma function and is a 


continuous version of the choose function: 
(x)= | eat (29) 


To use the negative binomial for regression and make the distribution 
comparable to the Poisson regression, some adjustments to the distribution must 
be made. Let uy =E(y). Then: 
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: 
p= P= p= a (30) 
—p utr 


Substituting p and 1-p for functions of y and rin (28) gives: 


yj U ( U a eal U i 
y!l(r) utr tr y!T(r) tr yt+r 
al r }[ Ul )-4 cen 1 (31) 
oylP(r) utr) (utr Sa) 
- 




















Further, to see the relationship between the negative binomial and 


Poisson distributions, let r +o and y +0. This implies no chance of a success 


and continual counting of failures and leads to: 


Ey ae a (32) 
rey LEG PELY: [1+#) y! 





which leads back to the Poisson distribution. 


Estimating the parameters for the negative binomial regression, which 
also uses the log link function, is similar to estimating the parameters for Poisson 


regression: 


g(u,)=x/B = u, =g""(x/B)=e"" (33) 
Now, the MLE for the negative binomial can be found by first expressing 
the likelihood of observing y and equivalently the log-likelinoood as: 





pp ty) 
TITAN (eA 





uw” U(r+y,) 1 
Eo eens a rue? (14H) (34) 


= Sys trwy InP (r+ y,))+IN(4)—In(y,!)—In(F()) Inu, +r) rin( 1+) 


=D yix +In(F(r+y,))-In(y;!)-In(F(r))-y; In(x7B-+r)—rin(1+**) 
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Now, to find the MLEs B for B, take the derivative with respect to B: 








UVB) _ yy, x, + YjX; on -0 (35) 


As with the Poisson regression, there is no closed-form solution to this 
equation, leading to the use of a numerical method to solve for the parameters 


for the regression equation. 


In this study, the negative binomial regression is a means to expand upon 
the Poisson regression. When a negative binomial regression was used, a 
Poisson model was first fit to the data. The Poisson coefficients were then used 
as the starting values for the numeric computation of the negative binomial 


regression coefficients. 


E: CROSS VALIDATION 


Cross validation is a method used to ensure a regression model of any 
type is not overfit. An overfit model will only work for the observed regressors and 
will not be as useful for predicting future outcomes of the data set. This is a 
danger when many independent variables are available to predict the dependent 
variables and n is small and moderate. To cross validate a regression, the 
observed response values are randomly broken into m subgroups. The 
regression is then refit to m-7 of the original subgroups and the values for the 
mth group is estimated. In the extreme version of cross validation, called jack- 
knife, m=n. The regression model is fit to n-1 observations and used to predict 
the one observation that is left out. This is repeated n times. Let Yay SLs 
represent the predicted value of the /th observation obtained in this manner. The 


residuals of the estimated group are calculated, giving the cross-validation score 
of 





This score is compared to the residual standard error (RSE) for the 
complete model described in equation (17). 


If the cross-validation score is much higher than for the equated model, 
the model is said to be overfit. An overfit model may have too many variables or 
too many interactions, giving the regression the illusion of a very good fit when, in 
fact, the model is very good, but only for the given observation and not good for 
other data. 


For this study, the linear regression and Poisson models were both 
checked for overfitting. Because the negative binomial regressions are fit after 
each Poisson regression and use the same regressors as do the corresponding 
Poisson regression, they are not cross validated to check for overfitting. 


26 


lll. DATA ANALYSIS 


The data collected and researched in this study is specific to Salinas, 
California and thus any conclusions made about the data will be specific to that 


region of California. 


A. DATA COLLECTION 


In an effort to continue the work of Clark and Onufer, the data assessed in 
this research closely resembles the variables used in their 2009 thesis. The data 
has been updated to use any more current statistics pertaining to Salinas. All 


available 2010 data was added to the past data. 


All of the other data was collected from online federal resources. The 
purpose of this research was to give Salinas Police Department (SPD) an easily 
accessible tool to estimate crime levels with readily accessible data. Therefore, 


all data used in this research is publicly accessible. 


It is also important to note that inflation was taken into account with the 
study, but did not significantly affect the trend of the financial data and was 


therefore not input into the calculations. 


In order to estimate Salinas violence trends, two different types of 
variables are needed. Independent variables are the environmental factors 
effecting violence, such as Salinas Police Department budget, unemployment 
level, and prison statistics. The dependent variable is what is to be predicted, in 


this case, violence levels. 


A 2006 study by the Department of Justice showed that aggravated 
assault, auto theft, burglary, drug sales, theft, and robbery are the most likely 
criminal offences perpetrated by youth gangs (Egley & O’Donnell, 2008). 
Therefore, this study uses a summation of reported homicides, aggravated 
assaults, and robberies as reported yearly from Salinas to the Federal Bureau of 
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Investigation to predict future crime trends. This data is easily obtained from 
either the Salinas Police Department web page or the Federal Bureau of 
Investigation web page. 


B. CORRELATION OF VARIABLE ANALYSIS 


In order to accurately formulate a regression for prediction, the correlation 
between all of the variables was explored. The variables with the highest 
correlation to the dependent variable were used for regression formulation. 
Because the study is a time-based regression, there is a possibility that some of 
the variables will affect other variables one or even two years later. Therefore, 
not only was direct correlation explored, but also correlation with violence shifted 


one and two years in the future. Table 1 shows the resulting correlations. 






























































No Shift | One Year Shift Two Year Shift 
Population 0.687 0.635 0.572 
Drop Outs 0.310 0.09993 0.311 
Drop Out Rate 0.478 0.283 0.482 
SPD Budget 0.459 0.423 0.403 
SPD Employees 0.272 0.194 0.133 
Sworn Police -0.552 -0.705 -0.373 
CDCR Capacity 0.791 0.729 0.655 
CDCR Population 0.776 0.712 0.644 
CDCR Overpopulation 0.834 0.804 0.799 
Percentage 
Parole Population 0.815 0.767 0.713 
Parks and Recreation 0.242 0.298 0.352 
Budget 
Library Budget 0.572 0.550 0.549 
Unemployment Percentage 0.516 0.701 0.776 
Number of Vacant Units -0.526 -0.631 -0.724 
Personnel Per Household -0.242 -0.399 -0.651 





Table 1. Correlation of Independent Variables and Violence 


Examining the correlation led to the use of population, SPD Budget, sworn 
police with a one year shift, CDCR Overpopulation Percentage, parole 
population, unemployment percentage with a two year shift, number of vacant 
units with a two year shift and personnel per household with a two year shift. 
With the choice of variables, it was necessary to examine the correlation 
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between the variables. If two variables are highly correlated, only one of the 
variables will assist in the regression. The correlation between variables is 
described in Table 2. 





P Ss O Pa SP U N H 


Population (P) 1.000 | 0.891] 0.866] 0.948] 0.66) -0.80 0.92] 0.963 


SPD Budget (S) | 0.891 | 1.000] 0.684 | 0.769| 0.93] -0.83| 0.96] 0.714 
CDCR 

Overpopulation 
Percentage (O) 0.866 | 0.684} 1.000 | 0.909 | -0.07 | -0.41 0.64 | 0.832 
Parole 
Population (Pa) 0.948 | 0.769} 0.909 | 1.000 | -0.10/} -0.61 0.69 | 0.850 
Sworn Police 
(SP) 0.659 | 0.934 | -0.065 | -0.10 | 1.000 | -0.665| 0.86 | 0.350 
Unemployment 
Percentage (U) -0.80 | -0.827 | -0.408 | -0.61 | -0.67) 1.000 | -0.89/} -0.70 
Number of 
Vacant Units (N) | 0.922 | 0.963 | 0.643 | 0.686 | 0.864 | -0.89 | 1.000 | 0.844 
Personnel Per 


Household (H) | 9963] 0.714] 0.832 | 0.850 | 0.350| -0.70 | 0.844] 1.000 


Table 2. Correlation between Independent Variables for OLS Regression to 
Predict Violence 



























































Examining the correlation between pairs of dependent variables led to the 
removal of population and parole population from the regression formulation. 
Those two variables were highly correlated with other variables and were 
accounted for by the other variables. 


A graphical depiction of the correlation for the chosen variables was also 


examined in Figure 5: 
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0.776 Correlation -0.724 Correlation -0.651 Correlation 








Figure 5. Graphical Representation of Correlation between Independent 
Variables and Violence. Violence in Red 


In each of the panes in Figure 5, violence (red) is plotted against time. 
Each of the blue lines represents an independent variable (as labeled in each 


panel) plotted against time. 


Most of the variables for the study, be it budgets or manpower, increased 
over time. This trend similarity indicates that many of the variables will not assist 
in a regression because of the similarity of correlation between independent 


variables. 


It is also of interest to the City of Salinas to predict future homicide rates 
using the economic variables. Therefore, the correlation between the variables 
and Salinas homicide rates with shifts in years was also calculated and displayed 
in Table 3. 
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No Shift 


One Year Shift 


Two Years Shift 



























































Population 0.610 0.593 0.582 
Drop Outs 0.065 0.209 0.154 
Drop Out Rate -0.052 0.171 0.213 
SPD Budget 0.586 0.618 0.618 
SPD Employees 0.502 0.561 0.556 
Sworn Police 0.103 0.250 0.215 
CDCR Capacity 0.614 0.625 0.641 
CDCR Population 0.596 0.613 0.625 
CDCR Overpopulation Percentage 0.470 0.504 0.523 
Parole Population 0.588 0.651 0.671 
Parks and Recreation Budget 0.557 0.541 0.333 
Library Budget 0.773 0.568 0.329 
Unemployment Percentage 0.248 0.002 -0.259 
Number of Vacant Units 0.262 0.219 0.169 
Persons Per Household 0.340 0.161 -0.076 





Table 3. 


Salinas 


Correlation between Dependent Variables and Homicide Events in 


An inspection of the correlation lead to the use of population, SPD Budget, 


CDCR Overpopulation Percentage, parks and recreation budget, and library 


budget. None of the variables were shifted because there was not enough of a 


correlation disparity to warrant a shift. The correlation between these variables 


were also explored and displayed in Table 4. 






































P Ss O R L 
Population (P 1.000 0.898 0.879 0.638 0.841 
SPD Budget (S) 0.898 1.000 0.700 0.745 0.746 
CDCR 
Overpopulation 
Percentage (O 0.879 0.700 1.000 0.537 0.734 
Parks and 
Recreation 
Budget (R) 0.638 0.745 0.537 1.000 0.751 
Library Budget 
(L) 0.841 0.746 0.734 0.751 1.000 
Table 4. | Correlation between Independent Variables for Homicide Regression 


All of the independent variables of the homicide regression are highly 


correlated, but all variables were kept for the initial exploration of regression for 
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homicide prediction. A graphical representation of the correlation between the 


variables and homicide is displayed in Figure 6. 
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Figure 6. Graphical Correlation between Homicide and Independent Variables. 


Homicides in Red 


Assaults, defined in this study and reported to the FBI, are defined as 


aggravated assaults and consist of assaults with a weapon involved. These 


crimes could have easily escalated into homicides. Because of this, regression 


analysis was used to predict the amount of assaults and homicides in Salinas. 


The correlations of the independent variables against homicide and 


assault numbers with one and two-year shifts are in Table 5. 
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No Shift One Year Two Year 
Shift Shift 
Population 0.589 0.520 0.436 
Drop Outs 0.362 0.276 0.152 
Drop Out Rate 0.498 0.441 0.343 
SPD Budget 0.370 0.325 0.288 
SPD Employees 0.156 0.063 -0.028 
Sworn Police -0.479 -0.521 -0.194 
CDCR Capacity 0.693 0.610 0.507 
CDCR Population 0.684 0.596 0.501 
CDCR Overpopulation Percentage 0.827 0.766 0.729 
Parole Population 0.728 0.649 0.564 
Parks and Recreation Budget 0.245 0.257 0.309 
Library Budget 0.493 0.435 0.399 
Unemployment Percentage 0.575 0.690 0.773 
Number of Vacant Units -0.672 -0.676 -0.685 
Personnel Per Household -0.486 -0.522 -0.602 


Table 5. 


Correlation of Variables against Assaults and Homicides 





Examination of the correlation of assaults and homicides against the 
possible regressors led to the use of CDCR Overpopulation percentage, parole 
population, unemployment percentage with a two year shift, number of vacant 
units, and person per household with a two year shift. The correlation between 
these possible variables is displayed in Table 6. 


O Pa U N H 
CDCR Overpopulation 1.000 | 0.909 | -0.408 | 0.770 | 0.832 
Percentage (O) 



































Parole Population (Pa 0.909 | 1.000 | -0.605 | 0.805 | 0.850 
Unemployment Percentage (U) -0.408 | -0.605 | 1.000 | -0.859 | -0.701 
Number of Vacant Units (N) 0.770 | 0.805 | -0.859 | 1.000 | 0.909 
Personnel Per Household (H) 0.832 | 0.850 | -0.701 | 0.909 | 1.000 





Table 6. Correlation between Independent Variables for Homicide and 


Assault Regression 


With an examination of the inter-correlation between possible regressor 
variables, it was decided to not use personnel per household or number of 
vacant units in the formulation of the regression to predict assaults and 
homicides. The graphical depiction of correlation between the chosen regressors 


and assaults and homicides is graphed in Figure 7. 
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Figure 7. Graphical Correlation between Homicide and Assaults and 
Independent Variables. Homicides and Assaults in Red 


After all variables were chosen, all three of the dependent variables, 
violence, homicides, and assaults and homicides, were input into linear, Poisson, 
and negative binomial models in an attempt to predict possible future criminal 


activity levels in Salinas. 


C. REGRESSION ANALYSIS 


In order to estimate violence levels for Salinas, California, the regression 
techniques outlined in Chapter II were applied to the data used in the Onufer and 
Clark (2009) thesis. 


1. Violence Prediction using Ordinary Least Squares 


The initial regression for all three variables was ordinary least squares 
(OLS), or general linear regression. The method used to find the optimal 
regression was backward elimination, wherein all of the variables of interest 
identified in the previous section were included in the regression and removed if 


the variable did not add to the quality of the regression. 


An initial regression included person per household included and it was 
found that person per household added nothing to the model and was therefore 
taken out of the initial model. The second regression equation for violence 
prediction with OLS regression was: 
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y=-1.827x10° +9.384x10 °(SPD Budget) 
-8.354x10?(CDCR Overpopulation Percentage) 
+50.62(Unemployment with Two Year Shift) (37) 
+1.712(Number of Vacant Units with Two Year Shift) 
-13.93(SPD Sworn Police with One Year Shift) 

This initial regression seems to be a decent fit for the violence levels. The 


p-values for the coefficients are given in Table 7. 
































Estimates | Standard Error P-value 
Intercept 1.83x10° 9.46 x10° 0.09465 
SPD Budget 9.38x10° | 5.87 x10° 0.154 
CDCR Over Population Percentage -8.35 x10° | 4.01 x10 0.076 
Unemployment -1.39 x10 2.75 0.0381 
Number Vacant Units 5.06 x10 1.99 x10 0.125 
SPD Badged Police 1.71 9.82 x10" 0.00147 





Table 7. P-Values for the initial OLS regression 


This regression equated an R* value of 0.8449. This also coincides with a 


good OLS fit for violence prediction. 


With five variables being included into the model, there is worry that the 
regression may overfit the regression. Therefore, cross-validation was used on 
this model and will be used on all subsequent models to check for overfitting. The 
RSE for this model was 34.8 and the cross-validation score was 42.59. The 
model may be slightly overfit, but is still within acceptable means and will be used 


to predict violence future violence levels in Salinas. 


Although a good fit for the data was quickly derived from the initial values, 
there was interest in attempting to fit another OLS regression with other variables 
of interest and a more complete data set. The regression in equation (42) was 
limited by the observations of SPD Badged Police, being only recorded from 
1997-2010. 


This new regression fit consisted of SPD budget and CDCR 


Overpopulation percentage. The formula for this regression is: 
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-4.259x10?-6.239x10°(SPD Budget) 

















+9.036x10?(CDCR Overpopulation Percentage) 88) 
with the following p-values displayed in Table 8. 
Estimates | Standard Error | P-value 
Intercept -3.34x107 | 1.73x107 0.0635 
SPD Budget -3.23x10° | 3.14x10° 0.0858 
CDCR Over Population Percentage | 8.22x10* | 1.18x107 6.21x10° 














Table 8. Second Fit OLS Regression for Violence P-Values 


With an R? of 0.7285. Although the R? for this regression is not as high, 
this model emphasizes different variables, which may be useful to the City of 


Salinas. The RSE for this model was 143.5 with a cross-validation score of 


149.4174. These two scores are very close together so this model is not overfit. 


A graphical depiction of the fit of the two different models is shown in 


Figures 8 and 9. 





1400 
1200 —* ae 
. . 
1000 
$ 
= 800 
2 
2S 600 
S 
400 
200 
O + 
nan OoOxaXHt nome 
nn Odo OoO 8G 
nD Oo GO CO AO OO 
A AHN NN NON 
Year 


2005 


Regression Fit 1 


2006 
2007 
2008 
2009 


Regression Fit 


e Violence 


2010 


Figure 8. Regression Fit for Formula (39) 
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Figure 9. Regression Fit for Formula (40) 


With an inspection of the regression fit through the actual data and 
examination of the RSE for each of the models, it was determined that formula 
(39) was the best fit for the OLS prediction of violence levels in Salinas. A 
Poisson model was then formulated to see if Poisson regression would be a 


better means of predicting violence. 


2. Violence Prediction using Poisson Regression 


As stated by Osgood (2000), Poisson regression is often useful for 
exploring crime rates. Therefore, in this study, all response variables were 


analyzed with Poisson regression, in addition to OLS regression. 


The starting method to Poisson regression was similar to OLS regression. 
The initial Poisson model used the same regressors as the OLS model: 
y=exp( 7.679-8.695x10 °(SPD Budget) 
-1.287x10 *(SPD Sworn Police) 
-7.743x10 ‘(CDCR Overpopulation Percentage) (39) 
+4.666x10 *(Unemployment) 


+1.582x10°(Number of Vacant Units)) 
with the p-values for the regressors displayed in Table 9. 
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Estimate | Standard Error | P-value 
Intercept 7.68 8.30x10" 2.00x10°% 
SPD Budget 8.70x10" | 5.14x10° 0.091 
CDCR Over Population Percentage | -7.74x10" | 3.51x10" 0.0276 
Number Vacant Units -1.29x107 | 2.44x10° 0.0637 
Unemployment 4.67x10° | 1.73x107 0.00701 
SPD Badged Police 1.58x10° | 8.5x107 1.28x107 





Table 9. 


Initial Poisson Regression for Violence 


Ideally, for a Poisson regression, the residual deviance should be close to 


the degrees of freedom. For this regression, the deviance was 7.48 for 7 degrees 


of freedom. This shows not only is this a good fit using the Poisson regression, 


but that the data is not overdispersed. 


This regression was also tested for over fitting. The cross-validation score 
for the data was 1646.29 and the RSE for the model was 1217.46. These 
numbers suggest overfitting, but the numbers are within acceptable means. A 


graphical representation of this fit is shown in Figure 10. 





Poisson Regression Fit 
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Figure 10. Poisson Regression fit for Violence Prediction 
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The Poisson regression seems to predict violence well and there is no 
evidence of overdispersion. However, for completeness the negative binomial 


was also explored. 


3: Violence Prediction using Negative Binomial Regression 


Osgood states that Poisson regression is valid for crime rate exploration, 
but that negative binomial regression can also be used and may be more efficient 
with the ability of negative binomial to reduce the error caused by overdispersion 
with which a Poisson model cannot compensate. The initial model for the 
negative binomial regression was identical to the final model for the Poisson 
regression. The coefficients from the Poisson regression were also used as the 
initial guess for the numerical method to determine the coefficients for the 
negative binomial model. The resulting model was exactly same model as the 
Poisson model. This indicates that the negative binomial regression is not an 
improvement over the Poisson model and was not used to predict violence in this 


study. 


4. Homicide Prediction using Ordinary Least Squares 


With homicide levels almost five times the national average, the 
leadership in Salinas is constantly finding ways to decrease the violence levels in 
their city. To do this, it would be useful to see the factors that influence homicide 
levels in Salinas. Clarke and Onufer did this in their 2009 thesis, showing the 
economic factors correlated with violence. Taking this a step further, the city can 
predict future homicide levels and estimate the change on homicide levels by 


focusing on different economic variables. 


The initial regression method to predict homicides is the same as violence 
levels. However, different regressors were more highly correlated with homicides 
than violence. The initial OLS model was: 
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y=1.152-7.363x10-° (Population) 

+1.805x10-'(SPD Budget) 

-3.959(CDCR Overpopulation Percentage) (40) 
-1.745x10° (Parks and Rec Funding) 

+8.184x10 °° (Library Funding) 

This equation was not a very good fit to predict homicide rates in Salinas, 
as shown by the p-values for the coefficients in the regression. Only the p-value 
for library funding is small enough for a good regression fit. The p-values for the 
regression are given in Table 10. 























Estimate Standard Error P-value 
Intercept 1.15x10 9.78 0.2505 
Population -7.36x10° | 1.57x107 0.64332 
SPD Budget 1.81x10" | 2.46x10” 0.47005 
CDCR Over Population Percentage -3.96 6.85 0.56862 
Parks and Rec Fund -1.75x10° | 2.42x10° 0.47756 
Library Fund 8.18 x10° | 2.33x10° 0.00177 














Table 10. P-values for Initial OLS Model to Predict Homicide Rates 


Although the regression equated an R? of 0.6288 some of the variables 
are not needed with the presence of other variables in the regression. Exploring 
the model in greater detail and eliminating unnecessary variables find that the 
only independent variable necessary to predict future homicide rates based on 
the trends of past homicide rates is, surprisingly, the Salinas library funding. The 
model derived from this is: 

y=-0.5852+6.130x10 ° (Library Funding) (41) 

Although the R? is slightly reduced from the previous model at 0.5969, it is 
still high enough to show an adequate fit. R? will increase with more variables, 
whether or not the variables are necessary. The p-value for library funding in the 
model was 5.67x10’. This model is not overfit. The RSE for the model was 4.21 
and the cross-validation score was 4.33. A graphical depiction of the fit of this 


model to the homicide levels is displayed in Figure 11. 
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Figure 11. Graphical Depiction of OLS Regression to Predict Homicide Rates in 
Salinas 


The OLS fit for homicide is good, but homicides could easily have a count 
based distribution, so an exploration of a Poission fit for homicide prediction is 
advisable and is covered in the next section. 


5. Homicide Prediction using Poisson Regression 


The Poisson model derived for homicide prediction started with the same 
variables the OLS model started with. The initial Poisson model was: 


y= exp(2.040-2.433x10 °(Population) 

+4.777x10 ‘(SPD Budget) 

-0.1496(CDCR Overpopulation Percentage) (42) 
-1.437x10°’ (Parks and Rec Funding) 

+ 5.718x10 (Library Funding)) 


Much like the OLS model, this first fit for the Poisson regression to predict 
homicide levels was not a very good fit. The p-values for the regression are given 
in Table 11. 
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Estimate Standard Error P-value 
Intercept 2.04 6.22x10" 0.001048 
Population -2.43x10° | 9.15x10* 0.790286 
SPD Budget 4.78x10° | 1.33x10° 0.720189 
CDCR Over Population Percentage -1.50x10" | 4.74x10" 0.752115 
Parks and Rec Fund -1.44x10" | 1.54x107 0.349927 
Library Fund 5.72x10" | 1.56x10" 0.000235 





Table 11. P-values for Poisson Regression for Homicide Levels 


The Poisson model, just like the OLS model, reduced to a regression with 


Salinas library funding as the single regressor. The model derived was: 


y=exp(1.522+4.417x10~ (Library Funding)) 


The regression was checked for overfitting, but was found to not be overfit 


(43) 


with a cross-validation score of 17.557 and a model RSE of 16.96. The graphical 


interpretation of the model is shown in Figure 12. 
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Figure 12. Poisson Regression fit for Homicide 





The Poisson regression fit was an acceptable fit for the data, but there is 


the possibility of overdispersion. The residual deviance for the model is 41.196 


with 28 degrees of freedom. Therefore, the negative binomial regression was 


explored as an alternative to the Poisson regression. 
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6. Homicide Prediction using Negative Binomial Regression 


The starting point for the negative binomial regression for homicide 
prediction was the fit for the Poisson regression. This gave an equation of: 
y=exp(1.520+4.425x10 (Library Funding)) (44) 
This model is similar to the Poisson model with a lower residual deviance. 
The residual deviance of the negative binomial model was 34.36. This is a 
significant decrease from the Poisson model and suggests a better fit than the 
Poisson model for predicting homicide levels. The graphical fit is displayed in 
Figure 13. 
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Figure 13. Negative Binomial Fit for Homicide 


7. Assault and Homicide Prediction using Ordinary Least 
Squares 


The initial fit to predict assaults and homicides was: 


y= 548.549-115.333(CDCR Overpopulation Percentage) 
-0.002447(Parole Population) (45) 
+27.882(Unemployment with two year shift) 


with p-values of: 
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Estimate Standard Error P-value 
Intercept 5.49x10? | 4.31x10? 0.22266 
CDCR Over Population Percentage 1.15x10° 2.98x10° 0.70399 
Parolee Population -2.45x10° | 2.27x10° 0.29801 
Unemployment with 2 year shift 2.79x10 8.85 0.00662 














Table 12. P-values for OLS Regression for Assault and Homicide prediction 


The R? for the initial regression fit was 0.635. Although this seems to be a 


good fit, the regression to predict assault and homicide levels was dominated by 


unemployment with a 2-year shift, much like homicides and library funding. 


However, unlike the homicide regression, this equation benefited from an 


addition of a squared unemployment term. The ideal OLS regression equation 


was: 


Y=1330.385 -155.163(Unemployment with two year shift) 
+9.593(Unemployment with two year shift)’ 


(46) 


This regression was not overfit, having a cross-validation score of 56.82 
and an RSE for the model of 52.31. This equation resulted in an R? of 0.7117 
and a graphical interpretation shown in Figure 14. 
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Figure 14. OLS Regression for Assaults and Homicides 
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This was a good fit for the prediction. Poisson and negative binomial 


models were also explored. 


8. Assault and Homicide Prediction using Poisson Regression 


The initial fit for the Poisson regression to predict assaults and homicides 

















was: 

y= exp(6.324-1.690x10 ‘(CDCR Overpopulation Percentage) 

-3.226x10 °° (Parole Population) (47) 

+3.622x10* (Unemployment with two year shift)) 

with p-values of: 
Estimate Standard Error | P-value 

Intercept 6.32 2.55x10" 2.00x10° 
CDCR Over Population Percentage 1.69x10' | 1.78x10" 0.3413 
Parolee Population -3.23x10° | 1.36x10° 0.0174 
Unemployment with 2 year shift 3.62x10° | 5.25x10° 5.42x10°* 











Table 13. P-values for OLS Regression for Assault and Homicide prediction 


This model suggests that CDCR Overpopulation does not add to the 
regression. This variable was taken out. A squared term of unemployment was 
added to the model to give the following equation for the optimal Poisson 
regression fit: 

y= exp(7.470 —1.791x10 ° (Parole Population) 
—1.730x10- ‘(Unemployment with two year shift) (48) 
+1.063x10*(Unemployment with two year shift)’ ) 

The model is not overfit, with a cross-validation score of 3056.97 and a 
model RSE of 2639.76. A graphical representation of the model is given by 
Figure 15. 
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Figure 15. Poisson Fit for Assaults and Homicides 


This equation has a residual deviance of 52.856 with 15 degrees of 
freedom. There is strong evidence of overdispersion, which will be remedied with 


the negative binomial regression. 


9. Assault and Homicide Prediction using Negative Binomial 
Regression 


The initial fit, using the Poisson regression as a basis for the negative 
binomial regression was: 
y= exp(7.448 — 1.771x10 ° (Parole Population) 
—1.687x10 ‘(Unemployment with two year shift) (49) 
+1.04x10*(Unemployment with two year shift)’ ) 
This model also seems very close to the Poisson model but reduces the 
residual deviance from 52.856 to 19.16, implying a much better fit for the data 
using the negative binomial over the Poisson regression. The visual for the fit is 


depicted in Figure 16. 
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Figure 16. Negative Binomial Fit for Assault and Homicides 


D. PREDICTION RESULTS USING REGRESSION MODELS 


To test the validity of the models, 2011 data was gathered or, if the data 
was not yet published, estimated with the best available knowledge and past 
trends. 2011 and 2012 were both predicted, with the 2012 prediction results 
using model equated using 2011 data and is detailed in Appendix B. Although 
the models will predict as many years in the future as data is input, it is unwise to 
attempt to predict too far into the future with a regression model as changes to 
policy or environmental factors could change at any time. None of these models 
should be used to estimate more than one or two years of violence levels in 
Salinas. 


Violence prediction for 2011 is detailed in Table 14. 

















Salinas Violence 
OLS Regression Model 1810.954 
Poisson Regression Model 2117.753 
Negative Binomial Regression Model 2117.753 








Table 14. 2011 Violence Prediction based on the Derived Models 
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Homicide prediction is displayed in Table 15. 














Salinas Homicides 
OLS Regression Model 22.64106 
Poisson Regression Model 24.43126 
Negative Binomial Regression Model 24.45902 











Table 15. 2011 Homicide Prediction based on the Derived Models 


Finally, assaults and homicide predictions for 2011 are in Table 16. 





Salinas Assaults 
and Homicides 

















OLS Regression Model 835.2094 
Poisson Regression Model 836.4053 
Negative Binomial Regression Model 836.1911 





Table 16. 2011 Homicide and Assault Prediction based on the Derived Models 


The actual counts for 2011 are detailed in Table 17. 


Violence 1083 
Homicides 15 
Assaults and Homicides 709 


Table 17. Observed 2011 Crime Statistics 




















The predicted numbers are higher than the actual numbers for 2011 crime 


statistics in Salinas. 
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IV. CONCLUSION AND FUTURE WORK 


All three of the model types had similar predictions, with the exception 
being the model for violence. The different types of regression for predicting 
violence yielded varying results. The linear model for violence gave a closer 
prediction to the 2011 crime rates than the other two models. The Poisson and 
negative binomial models assume that the data has a Poisson distribution and, 
should the data not have a Poisson distribution, OLS regression will be as good, 
if not better, at prediction. This being the case, this study found that ordinary 
least squares models are adequate to predict crime trends in Salinas for all three 


of the explored dependent variables. 


Several economic variables are highly correlated with crime statistics in 
Salinas. These variables could be used to predict future crime rates in Salinas 
based on past trends and observations in the city. However, these numbers do 
not take into account policy changes enacted in the city, such as Operation 
Ceasefire. These operations are difficult to numerically quantify in a study, and 
can very well be responsible for the reduction in crime levels in Salinas. 


According to all of the models derived in this study, crime in Salinas 
should be on the rise in all categories. Salinas saw an increase in crime statistics 
from 2008-2010, but a reduction in crime in 2011. The most obvious conclusion 
to draw between the disparity between the statistical models and the actual crime 
levels is that Salinas is moving in the correct direction for crime prevention and 


gang reduction. 


These results lend heavy credence to a continuation to the current crime 
prevention methods in Salinas to include the gang task force, Operation Ceasfire, 
coordination with CASP, and any other methods of crime reduction. 


Although the 2011 predictions were incorrect, the 2012 calculations seem 
quite feasible and are listed in Appendix B. These numbers were derived from 
models that took into account 2011 economic and violence levels. The 2011 
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violence levels and environmental data may have aided in future predications 
and have included the efforts of the crime prevention tactics employed by 


Salinas. 


A. FUTURE WORK 


Future work in this area would apply the equated models of this thesis in 
different communities with smaller and larger populations to see if the models 
predict violence in communities other than Salinas. All of the variables in the 
study are present in any community, with the difference being the large gang 
presence in Salinas. It could be a significant study to see if the level of 


environmental variables effect crime levels, no matter the type of population. 


Another topic for future work is further exploration of the prison 
overpopulation problem compared to crime rates in California. As of the date of 
this thesis, prison overcrowding is a very important governmental topic and 
California prison population and crime could be compared to neighboring states 
prison population and crime rates to see if there are correlations between crime 


rates and prison populations. 


B. RECOMMENDATIONS 


Salinas California should maintain its current level of diligence in crime 
deterrence. There is some variable not taken into account in this study that must 
account for the drop in crime rates from 2010 to 2011 in Salinas. Currently, 
Salinas’s budget predictions for 2013 show a considerable reduction to the police 
force, down to 88 patrol police. This will decrease the police presence in Salinas 
to from 157 about 143 police officers. The population of Salinas does not show 
any sign of decreasing. This reduction in officers will equate to a ratio of around 1 
officer per 1050 people in Salinas. For a city Salinas’s size, the Bureau of Justice 
Statistics estimates the average to be 1.9 officers per 1000 residents (Reaves, 
2007, p. 9). 
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A reduction in police force can have a devastating effect on crime rates in 
an already crime ridden community. In September of 2011, Governor Chris 
Christie of New Jersey balanced the New Jersey budget. The consequences of 
the balance were a reduction by 103 officers from the Trenton, NJ police force. 
As a result, Trenton PD has seen a drastic increase in crime rates, having almost 
a shooting a day, up from one a week, as reported from Sergeant Mark Kieffer, a 
16-year veteran of Trenton PD (Glass, 2012). Although the proposed cuts in 
Salinas are not as drastic, the repercussions could be as dire as in Trenton. 


No single variable in the study should be concentrated on as the fix to the 
crime problem. It may seem preposterous to think that increasing the Salinas 
library budget will increase the number of homicides and this could well be a 
case of correlation having little or no link to causation. However, the statistically 
observed correlation of homicides and library funding could also be an artifact of 
perceptions by the citizens of Salinas concerning city funding policies. The 
leadership of Salinas would be wise to consider possible second or third order 
effects during budget negotiations. 


Many times in the study, the overpopulation of the California prison 
system was a variable of great concern. Overcrowding in prisons means early 
parole for prisoners. The parolees, when released, go back to their previous 
residence, which is also the place they committed the crime to lead to their prison 
sentence in the first place. The unemployment level in Salinas is currently around 
15% and the recidivism rate for California is around 70%. These factors point 


towards a continual high level of crime in Salinas. 


Finally, all findings from this study will be given to the Salinas leadership in 
an effort to assist Salinas’s crime problem in any way possible. Although the 
predictions are just that, predictions, these tools can be used to help guide the 
administration and the budgeting department for Salinas into the future. 
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APPENDIX A. DATA 


All data in the study was taken from government sources, when available, 
and from the past research from Clark and Onufer, when the government 
sources did not have the data. All of the links to data sources are listed below the 






















































































table. 

Year Population Homicides Robbery Assault Violence 
1980 80479 9 208 283 500 
1981 82700 9 191 344 544 
1982 85300 11 179 356 546 
1983 87600 2 200 372 574 
1984 91100 8 159 362 529 
1985 94600 10 167 424 601 
1986 98300 9 204 672 885 
1987 100800 7 192 633 832 
1988 103900 4 217 722 943 
1989 105400 7 217 734 958 
1990 108777 11 262 778 1051 
1991 111184 7 253 805 1065 
1992 114736 17 388 722 1127 
1993 116686 15 560 844 1419 
1994 120885 24 414 846 1284 
1995 121960 15 494 950 1459 
1996 124972 8 412 884 1304 
1997 127369 18 348 895 1261 
1998 132449 17 440 661 1118 
1999 136797 13 346 737 1096 
2000 142685 18 443 734 1195 
2001 144728 15 399 799 1213 
2002 146659 20 367 692 1079 
2003 148117 19 399 725 1143 
2004 149838 17 452 678 1147 
2005 149626 7 335 645 987 
2006 148707 7 383 683 1073 
2007 148782 14 378 711 1103 
2008 150898 25 334 633 992 
2009 150215 29 359 678 1066 



























































































































































2010 150441 19 365 755 1139 
2011 150441 15 374 694 1083 
Drop Out SPD Sworn 
Year | Drop Outs | Rate SPD Budget Employees | Police 
1980 | NA NA 5342123 | NA NA 
1981 | NA NA 5726778 | NA NA 
1982 | NA NA 6170072 173 | NA 
1983 | NA NA 6645438 177 | NA 
1984 | NA NA 7715577 180.5 | NA 
1985 | NA NA 8331832 180.5 | NA 
1986 | NA NA 9315054 184 | NA 
1987 | NA NA 9607781 180 | NA 
1988 | NA NA 10177311 184 | NA 
1989 | NA NA 10649699 184 | NA 
1990 | NA NA 11456256 186 | NA 
1991 | NA NA 13144292 186 | NA 
1992 157 2.6 13634881 187 | NA 
1993 237 S7 15683718 181 | NA 
1994 157 2.4 13612478 179 | NA 
1995 198 2.9 14471238 188 | NA 
1996 330 4.7 16393545 193 | NA 
1997 281 3.8 16929407 198 147 
1998 342 4.5 17575700 198 144 
1999 256 3.2 18852899 199 143 
2000 203 25 19288170 213 145 
2001 225 2.6 21713995 221 149 
2002 204 2.3 22040439 222 145 
2003 75 0.08 24224300 224 154 
2004 124 1.3 25241659 222 167 
2005 85 0.9 29704910 232 164 
2006 124 1.3 33356709 238 161 
2007 180 2.6 35416564 255 174 
2008 147 15 38380314 251 177 
2009 264 2.8 41187794 251 164 
2010 276 2.9 37360500 230 157 
2011 276 2.9 39852481 210 157 
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CDCR 

















































































































CDCR Overpopulation Parole 
Year | Capacity CDCR Population | Percentage Population 
1980 23534 23371 0.99307385 14650 
1981 23800 26372 0.99307385 13952 
1982 24611 31319 1.07155337 16072 
1983 25703 35965 1.2184959 22202 
1984 26792 40524 1.34237832 27000 
1985 29042 45528 1.39535845 30726 
1986 32097 53620 1.67056111 34771 
1987 36465 62949 1.72628548 43355 
1988 44124 69695 1.57952588 52788 
1989 47120 79849 1.69458829 61665 
1990 51013 90405 1.77219532 73096 
1991 54042 95930 1.77510085 85470 
1992 57986 98386 1.6967199 89453 
1993 61983 109654 1.76909798 88858 
1994 66183 118968 1.79756131 92958 
1995 70717 125585 1.77588133 96110 
1996 73121 135294 1.85027557 100934 
1997 75952 146656 1.93090373 105449 
1998 79877 150731 1.88703882 111875 
1999 79873 154440 1.93356954 117612 
2000 80272 154014 1.91865158 121414 
2001 80467 153649 1.90946599 121820 
2002 79957 151579 1.89575647 117138 
2003 80187 153783 1.91780463 114136 
2004 80980 157895 1.94980242 113768 
2005 81008 158837 1.96075696 115001 
2006 87370 166547 1.90622639 121808 
2007 84653 166277 1.96421863 126906 
2008 84066 160169 1.90527681 123597 
2009 84241 154749 1.83697962 109026 
2010 84596 168830 2.00413101 108656 
2011 84130 136619 1.623903483 100490 
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Person 

Parks and Library Number of Per 
Year | Rec Fund Fund Unemployment | Vacant Units Household 
1980 | NA NA NA NA NA 
1981 1443405 1032802 | NA NA NA 
1982 1533507 1077642 | NA NA NA 
1983 1704186 1203748 | NA NA NA 
1984 1734921 1278394 | NA NA NA 
1985 1956253 1410963 | NA NA NA 
1986 2257021 1584859 | NA NA NA 
1987 2223182 1518322 | NA NA NA 
1988 2674380 1748197 | NA NA NA 
1989 2768221 1825048 | NA NA NA 
1990 2178310 1955405 9.7 1228 3.21 
1991 2587574 2059120 11.4 1227 3.24729 
1992 2785467 2180700 12.5 1230 3.33079 
1993 1992787 2121146 13.1 1234 3.36624 
1994 1966659 2119116 12.4 1240 3.46189 
1995 1542992 2191539 12.3 1251 3.45003 
1996 1850056 2263639 113 1269 3.47405 
1997 1871278 2489376 14 1281 3.49622 
1998 2048408 2589735 10.8 1305 3.56309 
1999 2131408 2799503 9.7 1314 3.60341 
2000 2296515 2868795 7.4 1360 3.662 
2001 2817339 3020075 7.8 1370 3.69 
2002 2974138 3316832 8.9 1385 3.702 
2003 3081427 3423623 9 1400 3.7 
2004 2795909 3170427 8.3 1418 3.699 
2005 2285817 2614595 is 1433 3.654 
2006 2082617 1278414 6.9 1441 3.614 
2007 3153973 2207708 7.1 1450 3.601 
2008 3893586 3440113 8.4 1452 3.637 
2009 3976221 4061128 11.8 1463 3.643 
2010 3302147 3587431 12.8 1462 3.685 
2011 1571796 3788695 12.4 1462 4 
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All links verified as of 20 April 2012 


Salinas Population 1980. 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-4/1971— 
80/counties-cities/ 


Salinas Population 1981-1989. Retreived 20 April 2012: 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-4/1981—90/ 


Salinas Population 1981—1990 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-4/1981—90/ 


Salinas population 1990—2000 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-8/ 


Salinas population 2000-2010 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-5/2001 - 
10/view.php 


Vacant Houses: 1990-2000. Used total houses minus Occupied houses 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-8/ 


Vacant Houses: 2001-2010. Used total houses minus occupied houses 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-5/2001— 
10/view.php 


Person Per Household 1990-2000 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-8/ 


Person per household 2001—2010 
http://www.dof.ca.gov/research/demographic/reports/estimates/e-5/2001— 
10/view.php 


Prison and parolee populations: 1980-2009: 
http://www.cdcr.ca.gov/Reports_Research/Offender_Information_Services_Branc 
h/Annual/CalPrisArchive.html 


Prison and parolee Population: 2009-201 1 
http://www.cdcr.ca.gov/Reports/CDCR-Annual-Reports.html 


Police, Library, and Parks and Recreation 1981—2004: 
Clark and Onufer thesis 
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Police, Library, and Parks and Recreation 2005-2011: Taken from column 
labeled “Actual.” 2011 taken from column labeled “Budget.” 2012 taken from 
column labeled “Adopted.” 
http://www.ci.salinas.ca.us/services/finance/budget.cfm 

SPD Employee and SPD Sworn Police numbers taken from Clark and Onufer 
Thesis. 


Crime Stats 1985-2010: 
http://www.ucrdatatool.gov/Search/Crime/Local/RunCrimeJurisbyJurisLarge.cfm 


Crime statistics 2011: 
http:/www.salinaspd.com/statistics.html 


School drop-out information 1991-2010. Taken from column Grade 9-12 total 
drop-outs. 2011—2012 not currently published and estimated to be approximately 
2010 level: 
http://dq.cde.ca.gov/dataquest/DropoutReporting/DropoutsByGrade.aspx?cDistri 
ctName=SALINAS%20UNION%20HIGH%20%20%20%20%20%20%20%20%20 
%20%20%20&cCountyCode=27 &cDistrictCode=2766159&cSchoolCode=00000 
00&Level=District&TheReport=GradeOnly&ProgramName=All&cY ear=2009- 
10&cAggSum=D TotGrade&cGender=B 


Unemployment 1990-2011. 2012 estimated from current 2012 data: 
http://www.calmis.ca.gov/file/Ifhist/LabF orce-CAMSACo.txt 
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APPENDIX B. DERIVED MODELS AND PREDICTIONS 


There were nine different models derived in the study. The models were 
all equated using the data found in Appendix |. All of the models were equated in 
R. 


A. VIOLENCE MODELS 





OLS Model y=1.827x10°+9.384x10 °(SPD Budget) 
-8.354x10°(CDCR Overpopulation Percentage) 
+50.62(Unemployment with Two Year Shift) 
+1.712(Number of Vacant Units with Two Year Shift) 
-13.93(SPD Sworn Police with One Year Shift) 


Poisson Model y=exp(7.679-8.695x10 °(SPD Budget) 

-1.287x10*(SPD Sworn Police with one year shift) 
-7.743x10 ‘(CDCR Overpopulation Percentage) 
+4.666x10 *(Unemployment with two year shift) 
+1.582x10 °(Number of Vacant Units with two year shift)) 








Negative Binomial | Same as Poisson Model 
Model 








B. HOMICIDE MODELS 


OLS Model y=-0.5852+6.130x10 °(Library Funding) 





Poisson Model y=exp(1.522+4.417x10~’ (Library Funding)) 








Negative Binomial Model | y=exp(1.520+4.425x10 (Library Funding)) 
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C. HOMICIDE AND ASSAULT MODELS 


OLS Model 





Poisson Model 


y= exp(7.470 —1.791x10°° (Parole Population) 
—1.730x10-'(Unemployment with two year shift) 
+1.063x10-* (Unemployment with two year shift)? ) 








Negative Binomial Model 





y= exp(7.448 —1.771x10 ° (Parole Population) 
—1.687x10- ‘(Unemployment with two year shift) 
+1.04x10*(Unemployment with two year shift)’ ) 





D. PREDICTIONS USING MODELS: 


The predictions for 2011 were made with the derived models. The 


prediction for 2012 were made with models reformulated using the available 












































2011 data. 
2011 OLS Poisson Negative Recorded 
Binomial Levels 
Violence 1789.337 2077.991 2077.991 1083 
Homicides 22.64 24.43 24.46 15 
Homicides and | 881.40 879.80 879.02 664 
Assaults 
2012 OLS Poisson Negative Recorded 
Binomial Levels 
Wialenes 1093.26 1094.10 1094.10 NA 
HORmNcidas 21.64318 22.92571 23.03067 NA 
Homicides--and 901.64993 916.76166 914.25029 NA 


Assaults 
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APPENDIX C. R-CODE 


s=read.table("clipboard",header=T) 

FAH Correlation for Violence Level 
corallvio=setcor(s$Violence,s,2) 
names(corallvio)=c("NoShift","OneYear","TwoY ears") 

###Correlation between possible regressors for Violence##Ht}t 

##This consists of: Population, SPD Budget, 

###sworn police with a one year shift, CDCR Overpopulation Percentage, 
###parole population, unemployment percentage with a two year shift, 
###number of vacant units with a two year shift 

###personnel per household with a two year shift. 


cormat=data.frame(s$Population[3:31],s$SPDBudget[3:31],s6CDCRPercentage[ 
3:31), 


s$ParolePop[3:31], 
s$Police[2:30],ssUnemployment[1:29], 

s$Vacant[1 :29],s$PersonPerHouse[1:29]) 
corbetween=cor(cormat,use="pairwise.complete.obs") 


HHHHHHIt is cleaner to work with the a new dataframe used only for violence. The 
big dataset can be used 


viodata=data.frame(s$Violence[3:31],s$SPDBudget[3:31],s6CDCRPercentage[3: 
31], 


s$Police[2:30],s6Unemployment[1:29], 
s$Vacant[1:29]) 


names(viodata)=c("Violence","SPDBudget","CDCRPercentage","Police","Unempl 
oyment","Vacant") 


HHHHHHCorrelation for Homidices 
corallhom=setcor(s$Homicide,s, iter=2) 
colnames(corallhom)=c("NoShift","OneYear","TwoYears") 
###Correlation between possible regressors for Homicide#### 
##This consists of: Population, SPD Budget, 
###CDCR Overpopulation Percentage, 
###Parks and Recreation budget 
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#H##library budget 
##homdata will be used for the regression 


homdata=data.frame(s$Homicide,s$Population,s$SPDBudget,s6CDCRPercenta 
ge, 
s$ParksandRec,s$Library) 


names(homdata)=c("Homicide","Population","“SPDBudget","CDCRPercentage"," 
ParksandRec","Library") 


homcorbetween=cor(homdata,use="pairwise.complete.obs") 
HHHHHCorrelation for Assaults and Homicide 
HA=s$Homicide+s$Assault 

corallHA=setcor(HA,s, iter=2) 
colnames(corallhom)=c("NoShift","OneYear","TwoYears") 
###Correlation between possible regressors for Homicide and Assaults##### 
##This consists of: CDCR Overpopulation Percentage, 
#Htparole population 

###unemployment percentage with a two year shift 
###number of vacant units 

###tperson per household with a two year shift 

##HAdata will be used for he regression 
HA=s$Homicide+s$Assault 


HAdata=data.frame(HA[3:31],s$CDCRPercentage[3:31],s$ParolePop[3:31],s$Un 
employment[1:29], 


s$Vacant[3:31],s$PersonPerHouse[1:29]) 
HAcorbetween=cor(HAdata,use="pairwise.complete.obs") 


###Since we are not going to use Vacant Units or Person Per Household, lets 
put the first 2 obs back into the data set 


HAdata=data.frame(HA[3:31],s$CDCRPercentage[3:31],s$ParolePop[3:31],s$Un 
employment[1:29]) 


names(HAdata)=c("HA","CDCRPercentage","ParolePop","Unemployment") 
FAH Regression fitting for Violence 

###Linear to start 

### SPD Budget, Sworn Police (1 year Shift), CDCR Overpopulation Percentage 


## Unemployment (2 year shift), Vacant Units (2 year shift), Person per 
household (2 year shift) 
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H#Hfirst fit: 
violm=Im(Violence~., data=viodata) 


HH#TO cross-validate, all of the rows with NAs must be taken out. this is the first 
16 rows for this data set 


crossviodata=viodata|[17:29,] 
violm=Im(Violence~.-PersonPerHouse,data=crossviodata) 
###I like to run cross validation 10 times and take the mean 
xstat=1:10 

for(i in 1:10){ 

xstat[i]=xval(violm)} 

mean(xstat) 

###Second fit: 
vio2Im=Im(Violence~SPDBudget+CDCRPercentage,data=s) 
xstat=1:10 

for(i in 1:10){ 

xstat[iJ=xval(vio2lm)} 

mean(xstat) 

#####P redictions for the graphing of the two fits: 
pviolm=as.matrix(predict(violm)) 
pvio2Im=as.matrix(predict(vio2Im)) 

####Violence Poisson Regression 
viopoiss=glm(Violence~.-PersonPerHouse,data=viodata,family=poisson) 
crossglm(viopoiss) 
pviopoiss=as.matrix(predict(viopoiss,type='response’)) 
####Violence NB Regression 


vionb=glm.nb(Violence~.- 
PersonPerHouse,data=viodata, link=log,start=viopoiss$coefficients) 


FABAHHHHHHHH Regression fitting for Homicide tH HAHAH 
###Linear to start 

HfHffirst fit: 

homlm=Im(Homicide~., data=homdata) 

HHA HAF inal Fit: 
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hom|Im=Im(Homicide~Library,data=homdata) 
###Cross-validate 

###I like to run cross validation 10 times and take the mean 
xstat=1:10 

for(i in 1:10) 

xstat[i]=xval(homlm)} 

mean(xstat) 

#####P redictions for the graphing of the linear model: 
phomlm=as.matrix(predict(homlm)) 

H#HHViolence Poisson Regression## 
hompoiss=glm(Homicide~.,data=homdata,family=poisson) 
#H##First one is no good 

##Here is the final model 
hompoiss=glm(Homicide~Library,data=homdata,family=poisson) 
crossgilm(hompoiss) 
phompoiss=as.matrix(predict(hompoiss,type='response’)) 
#H#HHViolence NB Regression 


homnb=glm.nb(Homicide~Library,data=homdata,link=log,start=hompoiss$coeffici 
ents) 


phomnb=as.matrix(predict(homnb,type="response")) 


FABHHHHHHHHH Regression fitting for Homicide and 
Assaults##HHAHHHHA AHHH 


###Linear to start 

HHfHffirst fit: 

HAIm=Im(HA~., data=HAdata) 

HAHA F inal Fit: 

HAdata=HAdata[complete.cases(HAdata[,]),] 
HAIm=Im(HA~Unemployment+l(Unemployment’2),data=HAdata) 
###To cross-validate, all of the rows with NAs must be taken out. 
###I like to run cross validation 10 times and take the mean 
xstat=1:10 

for(i in 1:10) 
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xstat[iJ=xval(HAlm)} 

mean(xstat) 

#####P redictions for the graphing of the linear model: 
pHAIm=as. matrix(predict(HAlm)) 

####Homicide and Assault Poisson Regression## 
HApoiss=glm(HA~.,data=HAdata,family=poisson) 
###First one is no good 

##Here is the final model 


HApoiss=glm(HA~ParolePop+Unemployment+|(Unemployment’2),data=HAdata, 
family=poisson) 


crossglm(HApoiss) 
pHApoiss=as.matrix(predict(HApoiss,type='response’)) 
####Violence NB Regression 


HAnb=glm.nb(HA~ParolePop+Unemployment+l(Unemployment*2),data=HAdata 
link=log,start=HApoiss$coefficients) 


pHAnb=as.matrix(predict(HAnb,type="response")) 
FHHHHHHHE This is the future predictions 
$201 1=read.table("clipboard",header=T) 


plmvio201 1=predict(violm,newdata=data.frame(SPDBudget=s2011$SPDBudget, 
CDCRPercentage=s2011$CDCRPercentage, 


Police=s2011$Police, Unemployment=s201 1$Unemployment,Vacant=s2011$Vac 
ant, PersonPerHouse=s2011$PersonPerHouse)) 


pviolm201 1=predict(violm,newdata=s201 1) 

pviopoiss201 1=predict(viopoiss, newdata=s2011 ,type='response’) 
phomim201 1=predict(homlm,newdata=s201 1) 

phompoiss201 1=predict(hompoiss,newdata=s201 1,type='response’) 
phomnb2011=predict(homnb,newdata=s2011 ,type='response’) 
pHAIm201 1=predict(HAlm,newdata=s201 1) 

pHApoiss201 1=predict(HApoiss, newdata=s201 1 ,type='response’) 
pHAnb201 1=predict(HAnb,newdata=s201 1 ,type='response’) 


HnnHHHHHHHHere are the models with 2011 data added in and regenerated: It 
doesn't change the models very much 
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s=rbind(s,s2011) 


viodata=data.frame(s$Violence[3:32],s$SPDBudget[3:32],s6CDCRPercentage[3: 
32], 


s$Police[2:31],ssUnemployment[1:30], 

s$Vacant[1:30]) 
names(viodata)=c("Violence","SPDBudget","CDCRPercentage","Police", 
"Unemployment","Vacant") 

viodata=viodata[complete.cases(viodata],]),] 
homdata=data.frame(s$Homicide,s$Library) 
names(homdata)=c("Homicide","Library") 
homdata=homdata[complete.cases(homdatalJ,]),] 
HA=s$Homicide+s$Assault 
HAdata=data.frame(HA[3:32],s$ParolePop[3:32],s$Unemployment[1:30]) 
names(HAdata)=c("HA","ParolePop","Unemployment") 
HAdata=HAdata[complete.cases(HAdata[,]),] 
violm=Im(Violence~.,data=viodata) 
viopoiss=glm(Violence~.,data=viodata,family=poisson) 
vionb=glm.nb(Violence~.,data=viodata, link=log,start=viopoiss$coefficients) 
homlm=Im(Homicide~Library,data=homdata) 
hompoiss=glm(Homicide~Library,data=homdata,family=poisson) 
crossglm(hompoiss) 


homnb=glm.nb(Homicide~Library,data=homdata,link=log,start=hompoiss$coeffici 
ents) 


HAIm=Im(HA~Unemployment+l(Unemployment’2),data=HAdata) 


HApoiss=glm(HA~ParolePop+Unemployment+|l(Unemployment’2),data=HAdata, 
family=poisson) 


HAnb=glm.nb(HA~ParolePop+Unemployment+l(Unemployment’2),data=HAdata 
,link=log,start=HApoiss$coefficients) 


#Code for 2012 predictions. There are no 2012 actual numbers yet. most of the 
2012 data is estimated 


$2012=read.table(‘clipboard', header=T) 
pviolm2012=predict(violm,newdata=s201 2) 
pviopoiss2012=predict(viopoiss, newdata=s2012,type='response’) 
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phomlm2012=predict(homlm,newdata=s2012) 
phompoiss2012=predict(hompoiss, newdata=s2012,type='response’) 
phomnb2012=predict(homnb,newdata=s2012,type='response’) 
pHAIm2012=predict(HAlm,newdata=s2012) 
pHApoiss2012=predict(HApoiss, newdata=s2012,type='response’) 
pHAnb2012=predict(HAnb,newdata=s2012,type='response’) 


predictions=as.matrix(c(pviolm2012,pviopoiss2012,phomIm2012,phompoiss2012 
sphomnb2012, 


pHAIm2012, pHApoiss2012,pHAnb2012)) 
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